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Domain generation algorithm (DGA) is used as the main source of script in 
different groups of malwares, which generates the domain names of points 
and will further be used for command-and-control servers. The security 
measures usually identify the malware but the domain name algorithms will 
be updating themselves in order to avoid the less efficient older security 


detection methods. The reason being the older detection methods does not 
use either the machine learning or deep learning algorithms to detect the 
DGAs. Thus, the impact of incorporating the machine learning and deep 
learning techniques to detect the DGA is well discussed. As a result, they 
can create a huge number of domains to avoid debar and henceforth, block 
the hackers and zombie systems with the older methods itself. The main 
purpose of this research work is to compare and analyse by implementing 
various machine learning algorithms that suits the respective dataset yielding 
better results. In this research paper, the obtained dataset is pre-processed 
and the respective data is processed by different machine learning algorithms 
such as random forest (RF), support vector machine (SVM), Naive Bayes 
classifier, H20 AutoML, convolutional neural network (CNN), long short- 
term memory neural network (LSTM) for the classification. It is observed 
and understood that the LSTM provides a better classification efficiency of 
98% and the H20 AutoML method giving the least efficiency of 75%. 
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1. INTRODUCTION 

The internet is widely used and it has high standard security strategy team to identify the domain 
generation algorithm (DGA) traffic through older methods. Also, the security team will be providing a huge 
list of documents as to generate the list of domains for potential C2 traffic. Then the method they follow for 
finding the domain groups of the DGA algorithms are using more statistical properties of the DGA. The main 
drawback of the older methods is not used for protecting the system from recent domains and more on time 
detection. In this work, a technique to detect randomly generated domains using machine learning algorithm 
model [1] such as support vector machine (SVM), AutoML: H20, Naive Bayes classifier and random forest 
(RF), is being presented. Machine learning algorithms such as supervised learning algorithms, namely 
random forests (RFs) for decision making, SVM to process the labelled dataset predict the optimal hyper 
plane, thereby categorizing data. The classification is based on the structural, linguistic and statistical features 
of the respective domains. The second stage drawback with the machine learning algorithm is the “hand- 
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crafted features” which have derived variables, covariates, features being predictable by intruders and their 
time complexities in real time detection. Henceforth, to overcome this drawback, “learned-features” 
implementation is made using deep learning algorithms, to achieve better performances supported by deep 
learning algorithms such as long short-term memory neural network (LSTM) and convolutional neural 
network (CNN). As the second phase of the work, the dataset would be measured for the efficiency metrics 
with standard parameters amongst the entire set of proposed algorithms. Final phase of the work describes 
the solution to a better scheme of algorithms that could be used to detect the malicious variant. 


2. LITERATURE SURVEY 

The related works of the identification of DGA botnets that have been attempted using different 
technologies have been discussed in this section. The purpose remains where the use of recent technologies 
detects the pseudo random domain names tries to connect to the command and connect (C2) server. The work 
of Vinayakumar ef al. [2] have given insights on how to deal with DGA Botnets using deep learning and 
machine learning algorithms, which alternates the idea of blacklisting the domain names which is a non real- 
time statistical machine learning approaches. Deep learning methods, resembling classical machine learning 
methods, suggested in their work leverages detection on per domain bases, where feature engineering is not 
used and circumventing is not possible. 

Woodbridge et al. [3] have solutions from domain name system (DNS) query blacklisting is that and 
real time detection, with DGA classifiers and leverage the long short-term memory (LSTM). The work 
provides an in-depth analysis of the classifier functional interpretability at each layer. The data training set 
remains the key for the performance metrics of the detection, where best results of classification are deployed 
at the easiest possible. 

Zhou et al. [4], proposed a general system to detect the DGA with a new model with high coverage. 
This helps to understand the algorithms used in high range accuracy detection. The word level and character 
level analysis done using deep learning algorithm (convolution neural network). Results of the paper 
concludes that the work to categorize the domains into two or more classifier. 

Sharifnya and Abadi [5] planned a DGA grounded botnet detection algorithm by grouping the DNS 
queries of the host and also try to test it. the calculation helps in the understanding the possibilities of the 
hosts be a Botnet. Zhang et al. [6] used the NXDomain traffic clustering, classification of string features, and 
other methods that are frequently used such as number and alphabet domain classifier. The neural network 
has the entropy, bigram and length detections. it is layered neural network approach achieves 94% 
experimental results. 

Understanding the algorithms with the help of their research-based works was essential in the 
project. Breiman [7] had contributed for the RF as a combination of prediction trees, where each tree of the 
algorithm works independently and with unique random vector samples. Contribution for the classification of 
malware, RF algorithms are trained with different data set and are unique in its training and correlates 
distinctively, that earns better results. 

Ren et al. [8] related Naive Bayes classification to uncertain data without much trained dataset. The 
results have been shown that the prediction is far more unique than the theoretical approaches. Yeo ef al. [9] 
have successfully achieved a high accuracy in their malware detection, where they have used CNN, SVM, 
RF, multi-layered perceptron (MLP). The high range of accuracy was achieved only due to the use of 35 
features extracted from the packet supervision, rather than focusing on the IPs and the ports. The overall 
survey works are more insightful as the work of Idika and Mathur [10] suggests techniques, samples and 
have also proposed a classification method, which were created after understanding the short-comings of the 
signature-based, specification based and anomaly based detection methods. The work finally suggested that 
commercial-off-the-shelf (COTS) malware detector is easier to obfuscate. 


3. METHODS 

Domain name generation algorithm (DGA) is a botnet malware that is responsible for a continuous 
communication between the intruder and the bots. The practical challenges faced are majorly on the false 
positives of the malware distribution that certainly reduces the accuracy of intrusion detection and several 
limitations. Domain generation have been intercepted with different techniques, where the challenge lies with 
the real-time detection and security. Lack of real-time security is the disadvantages of the existing 
algorithms. 

Hence, its approach with the methods of machine learning and deep learning concepts uses 
automation NXDomain [11] classification and intelligence. It uses two supervised learning algorithms such 
as RFs and SVM. These two said algorithms normally utilizes the structural and self-structural features for 
detecting the domain data. When the ancient methods are used in finding the malicious codes, hackers just 
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change the custom code to bypass the security strategy model. This is the reason, deep learning and neural 
network approaches are brought in and considered and therefore it acts like a firewall so that it is very hard 
for the hackers to discharge this. All the learning algorithms will be using three different datasets out of those 
two datasets will be malicious and one as the group of good and bad domains as shown in Table 1. 


Table 1. DGA family classification 


DGA_Family Domain Type 
none prat.pt Normal 
banjori bxjofordlinnetavox.com DGA 
emotet tbaccrnxirtmuusq.eu DGA 
rovnix fbo6fssycmvf16nb47.net DGA 
none giftcardsinfol .icu Normal 


3.1. Data representation 

Data is represented through data domains that have been used as non-structural data. It is not like 
structural data for which it does not have any rules and regulations. In this paper work, the machine learning 
and deep learning techniques for analyzing the data are discussed where these two different approaches 
utilize the old-styled machine learning methods. Here, the respective algorithm transforms the data to a 
complete structural data and thus the deep learning uses the same uni-structural data for the brain processing 
methods. This processing method is known as the artificial neural networks that processes different steps of 
dataset. 


3.2. Feature engineering 

Machine learning is used for the attribute domain and that is not sufficient in this case. It needs more 
definite feature sets for which it requires the knowledge and the respective references for further processing. 
The features are mainly classified as: structural features shown in Table 2, linguistic features shown in 
Table 3, statistical features shown in Table 4. 


Table 2. Structural features 


Structures Ex: Hamata.pt Ex: husjnshdj.eu 
Domain names vi 20 
Nos 1 2 
Length of mean 3.0 9.0 
prefixes 0 0 


Table 3. Linguistic features 


Structures Ex: huskak.pt _ Ex.ppposft.eu 
Number of digits 0 0 
Ratio of vowels 0.4 0.25 
Digits ratio and vowels 0.33 0.0 


Table 4. Statistical features 


Structures Ex.hshsa.pt _ Ex.jdbsjhss.eu 
Frequent letters in names ratio 0.3 0.4 
Successive letters ratio 0.5, 0.725 
Uninterrupted digits ratio 0 0 
Change in the names 2.34 3.6 


3.3. DGA detector system 

The proposed DGA Botnet detector system is the model that is a hybrid culmination of the selective 
machine learning and deep learning algorithms described in the upcoming sections. The overall model 
Figure 1 comprises of these algorithms as a system, where the algorithms are trained using the similar dataset 
and the results are correlated: this is based on its accuracy and performance in the detection process. DGA 
detector system as shown in Figure 1. 
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Trained Dataset 


Naive Bayes 
: (Gaussian, 
AutoML:H20 Multinomial, 
Bernoulli) 


Result (R1) 


| 


Correlated Results 


Figure 1. DGA detector system 


3.3.1. AUTOML: H20 

In foremost productions, AutoML is most important functionalities which automated type of 
algorithms which produce one of the best models. The biggest advantage of machine learning auto H20 is 
needed for finding the best dataset model. Figure 2 is shown AutoML model. 


3.3.2. Random forest (RF) 

RF is used for making a decision in random data sets which uses lot of decision trees to guess the 
result and then it starts the innermost decision trees for voting and selection. Figure 3 refers the RF model. 
Independent decision trees are care- fully designed based on the domain attributes, of which the majority of 
the decision is predicted as the result of the system, as a whole. 


AUTO ML LEADERBORD 


Iteration Training 
Score 
Model | Features + Algorithm + Parameters | 95% 
(prefered Training 
; result) Set 
input: . : 
os Model 2 Features + Algorithm + Parameters | 92% 
Dataset / 
Metrics/ Model 3 | Features+ Algorithm + Parameters | 90% 
Constraints 


85% Test Set 


Model 4 | Features + Algorithm + Parameters 


Features + Algorithm + Parameters 


Figure 2. AutoML model Figure 3. RF model 


3.3.3. Naive Bayes classifier 

Naive Bayes is used for the classification of the different classes to a single class. The malware 
filters of Naive Bayes are based on the benign or malicious is based on the categorically inputted attributes. 
The (1) depicts the model’s outcome. 


P(x|c)P(c) 
P(x) 


P(c|x) = (1) 


were, 

P(c|x) is the posterior probability of class (c, target) give predictor (x, attributes) 
P(c) is the prior probability of class. 

P(x|c) is the likelihood which is the predictors's probability of the given class. 
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P(x) is the predictor probability at the first. 

This classifier adopts that the existence of a specific feature in a session is unconnected to the 
occurrence of any other feature. In this classifier three different models are proposed: Gaussian, Multinomial, 
and Bernoulli. Bernoulli Naive Bayes model assumes the features of the domain to have only two possible 
value, and hence discrete prediction of benign or malicious is made out. The Figure 4(a) represents the 
model, where the line of curve between benign and malicious, is obtained and a clear distinction between 
them is made based on the train and test datasets. 

Multinomial Naive Bayes model assumes the features of the domain to have discrete set of value. 
The labelling of data based on more than one feature is studied and the probability of those features to be 
benign or malicious is predicted as shown in Figure 4(b). Gaussian Naive Bayes model normalizes the test 
and train dataset results and differentiates the benign and malicious in a distinctive way, given a continuous 
range of values to the feature and the possibilities of being the malicious domain as described in the 
Figure 4(c). The features are studied as continuous data in this model. 


0.8 
0.6 = Feature | 
2 0.4 m Feature 2 
3 02 Feature4 Feature 3 
. Feature 3 
0 Feature 2 = Feature 4 
Feature | 
oY % 
$ g % 
~»> “v 2° 
FS s 
2 oa ee 
Feature | 
(a) (b) 


Malicious 


Figure 4. Naive Bayes classifier model (a) Gaussian, (b) Multinomial, and (c) Bernoulli 


3.3.4. Convolutional neural network (CNN) 
CNN is proposed neural network model which implements the text classifier methodology for 


decrease the upfitting by increase in the intake data and failure layer elimination and totaling the regularities. 
CNN model as shown in Figure 5. 


3.3.5. Long short-term memory neural network 

LSTM is proposed neural network model as shown in Figure 6, which are used for making guesses, 
dispensation and categorizing the input data. The features are studied as continuous data in this model. 
The overall architecture in Figure 7 gives better performance than the current one in the test database, we can 
test its actual impact on the application by having sample predictions for a small fraction of our application 
end users. While observing performance, the detector system increases the rate of test users gradually with 
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the new model in the hope that nothing will break. If the new dataset yields better result, the trained database 
will be updated by always returning the prediction of the new model. 


Input Output 
Layer Hidden Layers Layer 


Domain 
Data 


Figure 5. CNN model 


Dataset / Metrics/ Constraints 


Hidden Layer 1 | 


Hidden Layer 2 i 


Figure 6. LSTM model 


Training Dataset 


(Benign/Malicious) = Dae Detecns 


System 


Figure 7. The overall architecture 


4. IMPLEMENTATION 

The detection system for the DGA Botnets is more towards the performance of the deep learning 
and advanced machine learning algorithms used. The performance metrics are measured and represented as 
data graph. 
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4.1. Random forest (RF) 
RF which is a classification of decision trees which is seen for forecasting the result It get around 


92.45% accuracy and test dataset gives us 91.95%. Different features such as vowel ratio in the domain 
names, digit ratio, Figure 8 describes the feature importance levels of different domain keyword. in the graph 


the X axis is the normalised frequency across different features. 
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Figure 8. Accuracy of RF classifier 


4.2. Naive Bayes classifier 
In the implementation, three type of Naive Bayes models are analysed, namely the Gaussian Naive 


Bayes model, multi-nominal Naive Bayes model and Bernoulli Naive Bayes model. Figures 9-11 represent 
the accuracy results of Gaussian, multinomial and Bernoulli Naive Bayes models respectively. 


# (aluate the accuracy 
score_gnb_train = round(accuracy_score(train_y, train_gnb_pred) * 100, 2) 
score_gnb_test = round(accuracy_score(test_y, test_gnb_pred) * 100, 2) 
printC"Accuracy of Gaussian Naive Bayes on training dataset: ", score_gnb_train) 
print("Accuracy of Gaussian Naive Bayes on test dataset: ", score_gnb_test) 
Accuracy of Gaussian Naive Bayes on training dataset: 80.19 
Accuracy of Gaussian Naive Bayes on test dataset: 80.29 


Figure 9. Accuracy of Gaussian Naive Bayes model 


KF (agicuiate the accuracy 

score_mnb_train = round(accuracy_score(train_y, train_mnb_pred) * 100, 2) 
score_mnb_test = round(accuracy_score(test_y, test_mnb_pred) * 100, 2 
print("Accuracy of Multinomial Naive Bayes on training dataset: ", score_mnb_train) 
print("Accuracy of Multinomial Naive Bayes on test dataset: ", score_mnb_test) 


Accuracy of Multinomial Naive Bayes on training dataset: 75.03 
Accuracy of Multinomial Naive Bayes on test dataset: 75.12 


Figure 10. Accuracy of multinomial Naive Bayes model 
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4.3. H20 (Industry based AutoML) 

Here It use H20 which is an industry based AutoML landscapes that runs through the algorithms and 
their guidelines to generate a leading model. Which uses lot of models and compares the models for finding 
the best suitable models of the dataset, which performs 75 percentage accurately every time. Figure 12 
represents the accuracy table of all the algorithms fed to AutoML, the measure of maximum accuracy is 
determined by the value of AuC, were DRF AutoML outperforms with an Auc=.974029. 


# Calculate the accuracy 

score_bnb_train = round(accuracy_score(train_y, train_bnb_pred) * 100, 2) 
score_bnb_test — round(accuracy_score(test_y, test_bnb_pred) * 100, 2 
printC"Accuracy of Bernoulli Naive Bayes on training dataset: ", score_bnb_train) 
printC"Accuracy of Bernoulli Naive Bayes on test dataset: ", score_bnb_test) 


Accuracy of Bernoulli Naive Bayes on training dataset: 64.89 
Accuracy of Bernoulli Naive Bayes on test dataset: 65.01 


Figure 11. Accuracy of Bernoulli Naive Bayes model 


model_id auc logloss mean_per_ciass_error rmse mse 

DRF_1_AutoML_20181213_180050 0974029 0202789 0.0872518 0247939 0061474 
StackedEnsemble_BestOfFamily AutoML_20181213 180050 0.973925 0.215331 0.0869145 0.251191 0.0630969 
DRF_1_AutoML_20181213 113624 0.973167 0.205802 0.0878242 0.249722 0.0623611 
XRT_1_AutoML_20181213_180050 0.972285 0.222562 0.0872121 0.252391 0.0637013 
GBM_4_AutoML_20181213_180050 0.971904 0.212622 0.0912003 0.253394 0.0642084 
GBM_3_AutoML_20181213_180050 0.968063 0227041 0.0996403 0262608 0.068963 
GBM_2_AutoML_20181213_180050 0965728 0234857 0.104301 0.267685 00716552 
GBM_1_AutoML_20181213_180050 0963299 0242445 0.107354 0.272651 00743284 
GLM_grid_1_AutoML_20181213 180050 _model_1 0.91011 0.389243 0.164302 0.345246 0.119195 


Figure 12. Accuracy measure of H20 


4.4. Convolutional neural network (CNN) 

It used many CNN models here with lot of unique structure and configurations. The best CNN 
model It got is 1D and it should us accuracy around 80% with data testing. Figure 13 describes the accuracy 
measure of CNN, wherein the accuracy of two results, namely training dataset and test dataset are 
comparatively plotted, where X axis represents the percentage of accuracy for each epoch (along Y axis). 


4.5. Long short-term memory neural network (LSTM) 

LSTM is classified and processed and also used for making predictions of the layer-by-layer 
approach accuracy what It got is around 98%. Figure 14 represents the accuracy measure of LSTM, wherein 
the accuracy of training dataset and test dataset are comparatively plotted, where X axis represents the 
percentage of accuracy for each Epoch (along Y axis). 


Model accuracy Model accuracy 
0.84 z 
—— tran 
aso —— Test 
agg 
os7 
3 ass 
| 
8 09s 
as4 
93 
ag2 
00 o5 10 15 20 25 30 35 40 0 2 4 6 8 » P Mu 
Epoch Epoch 
Figure 13. Accuracy measure of CNN Figure 14. Accuracy measure of LSTM 
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Figure 15 tabulates the score of each model, wherein by working and doing comparison of all the 
models accuracy percentage. It is shown that both the AutoMl and LSTM will have best accuracy rate in 
DGA Detection but H20 requires high time to train the data set. The future scope of the work in detection of 
DGA would be reducing the training data set, and increasing the false positives rate beyond the designed 
model. This could be achieved by the use of most advanced and recent deep learning algorithms that could 
categorize the domains efficiently. This work would significantly impact the upcoming future works on real 
time DGA botnet detection. 
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Figure 15. Accuracy comparison of the used algorithms 


5. CONCLUSION 

The work presents an approach to classify DGA generated domains using deep learning that has a 
technical advantage as they are unsupervised learning real-time classifiers and featureless. Therefore, there is 
no need to generate features manually, instead the features are self-extracted during the training. The LSTM, 
AutoML, RF, CNN, Naive Bayes are the selected algorithms for the work. The DGA families have 
concatenated the words randomly from the dictionaries, which had to be trained as the dataset. Analysis of 
the functional interpretability is worked on each layer of the classifier; these layers are different algorithm. 
RF makes the primary decision using the decision tree of malicious domain, then the Naive Bayes classifier 
on secondary decision making. LSTM and CNN consolidation of the results of the other decision trees is 
made in the implementation. Thus, the experimentation results show that open-source dataset has tested the 
performance results with 90% false positive rates. 
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